A Phonetic Analysis of Natural Laughter, for Use in Automatic Laughter Processing Systems
نویسندگان
چکیده
In this paper, we present the detailed phonetic annotation of the publicly available AVLaughterCycle database, which can readily be used for automatic laughter processing (analysis, classification, browsing, synthesis, etc.). The phonetic annotation is used here to analyze the database, as a first step. Unsurprisingly, we find that h-like phones and central vowels are the most frequent sounds in laughter. However, laughs can contain many other sounds. In particular, nareal fricatives (voiceless friction in the nostrils) are frequent both in inhalation and exhalation phases. We show that the airflow direction (inhaling or exhaling) changes significantly the duration of laughter sounds. Individual differences in the choice of phones and their duration are also examined. The paper is concluded with some perspectives the annotated database opens. 1 Motivation and Related Work Laughter is an important emotional signal in human communication. During the last decades, it received growing attention from researchers. If we still do not understand exactly why we laugh, progress has been made in understanding what it brings us (enhanced mood, reduction of stress, and other health outcomes [2, 14]) and in describing how we laugh (see [1, 5, 17, 19]). This paper will focus on the last aspect, laughter description, with the aim of improving automatic laughter processing. In particular, we will mainly consider the acoustic aspects. Bachorowski et al. [1] were the first to extensively report about the acoustic features of human laughter. They classified laughs in three broad groups: songlike, snort-like and grunt-like. They also labeled the syllables constituting these laughs as voiced or unvoiced. They analyzed several features (duration, pitch, formants) over syllables and whole laughs. They found that mainly central vowels are used in laughter and that the fundamental frequency can take extreme values compared to speech. More generally, laughter has been identified as a highlyvariable phenomenon. Chafe [5] illustrates a variety of its shapes and sounds with the help of acoustic features (voicing, pitch, energy, etc.). However, despite the numerous terms used in the literature to describe laughter (see the summary given by Trouvain [21]), there is currently no standard for laughter annotation. Phonetic transcriptions appear in a few laughter-related papers (see [7, 16]) but, to our knowledge, no large laughter database has been annotated that way. For example, the two most used natural laughter databases, the ICSI [9] and AMI [4] Meeting Corpora, do not include detailed laughter annotation (only the presence of laughter in a speech turn is indicated). The ICSI Meeting corpus contains around 72 hours of audio recordings from 75 meetings. The AMI Meeting Corpus consists of 100 hours of audiovisual recordings during meetings. Both databases contain a lot of spontaneous, conversational laughter (108 minutes in the 37 ICSI recordings used in [22]). With the development of intelligent human-computer interfaces, the need for emotional speech understanding and synthesis has emerged. In consequence, interest for laughter processing increased. Several teams developed automatic laughter recognition systems. In [10, 22], classifiers have been trained to discriminate between laughter and speech, using spectral and prosodic features. Reported Equal Error Rates (EER) were around 10%. The local decision was improved in [11] thanks to long-term features, lowering the EER to a few percent. Recently, Petridis and Pantic [15] combined audio and visual features to separate speech from voiced and unvoiced laughter with 75% of accuracy. No method has been designed to automatically label laughs, classify them in finer categories than simply voiced or unvoiced, or segment long laughter episodes in laughter “bouts” (exhalation phases separated by inhalations). A few researchers have also investigated laughter synthesis. Sundaram and Narayanan [18] modeled the energy envelope with a mass-spring analogy and synthesized the vowel sounds of laughter using linear prediction. Lasarcyk and Trouvain [13] compared synthesis by diphone concatenation and 3D modeling of the vocal tract. Unfortunately, in neither case the obtained laughs were perceived as natural by naive listeners. A recent online survey [6] confirmed that no laughter synthesis technique currently reaches a high degree of naturalness. In a previous work, we have developed an avatar able to join in laughing with its conversational partner [24]. However, the laughs produced by the virtual agent were not synthesized but selected from an audiovisual laughter database, using acoustic similarities to the conversational partner’s laughs. We strongly believe that both automatic laughter recognition/characterization and synthesis would benefit from a detailed phonetic transcription of laughter. On the recognition side, transcriptions can help classifying laughs, on a simple phonetic basis or via features easily computed once the phonetic segmentation is available (syllabic rhythm, exhalation and inhalation phases, acoustic evolution over laughter syllables or bouts, etc.). On the synthesis side, transcription enables approaches similar to those used in speech synthesis: training a system with the individual phonetic units and then synthesizing any consistent phonetic sequence. In this paper, we present the phonetic annotation of the AVLaughterCycle database [23], which currently is the only large (1 hour of laughs) spontaneous laughter database to include audio, video and phonetic transcriptions. In addition, we use these phonetic transcriptions to study some factors of variability – 1 Accuracy and Equal Error Rates cannot be directly compared. However, 1−EER is a measure of the accuracy; with no guarantee it is the best the system can achieve. the airflow direction and personal style –, which received few interest in previous works. The annotation process is explained in Section 2. Section 3 presents the most frequent phones in exhalation and inhalation phases and shows differences in their duration. Section 4 focuses on individual differences in the phones used and in their durations. Finally, conclusions are given in Section 5. They include perspectives we consider with the large phonetically annotated database, which is the groundwork for further developments in the laughter processing field.
منابع مشابه
Evaluating automatic laughter segmentation in meetings using acoustic and acoustic-phonetic features
In this study, we investigated automatic laughter segmentation in meetings. We first performed laughterspeech discrimination experiments with traditional spectral features and subsequently used acousticphonetic features. In segmentation, we used Gaussian Mixture Models that were trained with spectral features. For the evaluation of the laughter segmentation we used time-weighted Detection Error...
متن کاملAcoustic Features of Four Types of Laughter in Natural Conversational Speech
This paper presents the results of an analysis of the representative sounds of human laughter from a large corpus of naturally-occurring conversational speech. Two contrasting manners of laughter were categorized for the study: polite formal laughs and sincere mirthful laughs, and a formant analysis was performed on four phonetic classes of laugh therein. Laughing speech was also common in the ...
متن کاملAnalysis of the occurrence of laughter in meetings
Automatic speech understanding in natural multiparty conversation settings stands to gain from parsing not only verbal but also non-verbal vocal communicative behaviors. In this work, we study the most frequently annotated non-verbal behavior, laughter, whose detection has clear implications for speech understanding tasks, and for the automatic recognition of affect in particular. To complement...
متن کاملDemonstrating Laughter Detection in Natural Discourses
This work focuses on the demonstration of previously achieved results in the automatic detection of laughter from natural discourses. In the previous work features of two different modalities, namely audio and video from unobtrusive sources, were used to build a system of recurrent neural networks called Echo State networks to model the dynamics of laughter. This model was then again utilized t...
متن کاملSegmenting Phonetic Units in Laughter
Laughter as an every-day, human-specific, affective, nonverbal vocalisation has attracted researchers from many disciplines. One consequence of the multi-disciplinarity is that the way of segmenting the acoustic signal of laughter as well as the terminology are used in a heterogeneous and partially contradictory way. This study tries to analyse the terminological variety from a phonetic perspec...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2011